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Abstract 

Sensitivity derivative (SD) calcula- 
tion via automatic differentiation (AD) 
typical of that required for the aerody- 
namic design of a transport-type aircraft 
is considered. Two ways of computing SD 
via code generated by the ADIFOR auto- 
matic differentiation tool are compared 
for efficiency and applicability to prob- 
lems involving large numbers of design 
variables. A vector implementation on 
a Cray Y-MP computer is compared with 
a coarse-grained parallel implementation 
on an IBM SP1 computer, employing a For- 
tran M wrapper. The SD are computed 
for a swept transport wing in turbulent, 
transonic flow; the number of geometric 
design variables varies from 1 to 60 with 
coupling between a wing grid generation 
program and a state-of-the-art, 3-D com- 
putational fluid dynamics program, both 
augmented for derivative computation via 
AD. For a small number of design vari- 
ables, the Cray Y-MP implementation is 
much faster. As the number of design 
variables grows, however, the IBM SP1 be- 
comes an attractive alternative in terms 
of compute speed, job turnaround time, 
and total memory available for solutions 
with large numbers of design variables. 
The coarse-grained parallel implementa- 
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tion also can be moved easily to a network 
of workstations. 

Nomenclature 


Symbols 


C D 

Wing drag coefficient 

C l 

Wing lift coefficient 

Cm 

Wing pitching moment 
coefficient 

F 

Function viewed as a 
mapping 

Fj 

Objective functions 

G 

Generic geometric variable 

J 

Jacobian 

S 

Seed matrix 

SD ;j 

Sensitivity derivative 
matrix 

X 

Generic grid coordinate 

Xi 

Design variables 

cmax 

Airfoil section maximum 
camber 

e_i 

i-th canonical unit vector 

f- g, h 

Generic functions 

r, s, t, u 

Generic variables 

t/c 

Airfoil section 
thickness-to-chord ratio 

xcmax 

Airfoil section location of 
maximum camber 

x, y, z 

grid coordinates 
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Acronyms 


AD 

Automatic differentiation 

.AD 

AD generated code 

ADIFOR 

AD of Fortran 

CFD 

Computational fluid 
dynamics 

DD 

Divided differences 

MDO 

Multidisciplinary design 
optimization 

MIMD 

Multiple instruction 
multiple data 

MW 

Megawords of computer 
memory 

NDV 

Number of design 
variables 

NP 

Number of processors 

SD 

Sensitivity derivative(s) 

TLNS3D 

3-D Navier-Stokes CFD 
program 

WINGDES 

Linear aerodyamics 
program 

WTCO 

Wing grid-generation 
program 


Introduction 


The realistic multidisciplinary de- 
sign optimization (MDO) of advanced air- 
craft using state-of-the-art computers is 
a challenging problem from both a physi- 
cal modeling and a computer science point 
of view. To produce an efficient air- 
craft design, many trade-offs are nec- 
essary among the various physical design 
variables. Similarly, to produce an ef- 
ficient design scheme, many trade-offs 
are necessary among the various MDO im- 
plementation options. In this paper, 
we examine the effects of vectorization 
and coarse-grained parallelization on a 
sensitivity derivative (SD) calculation 
using a representative sample taken from 
a transonic transport design problem. 

A typical MDO problem involves the 
minimization or maximization of one or 
more objective functions Fj by chang- 
ing many design variables X if subject 


to a number of constraints. The de- 
sign variables are usually inputs to one 
or more computer programs that must be 
executed to simulate the complex phenom- 
enon characterized by the objective func- 
tions. Many MDO schemes require the re- 
peated calculation of a matrix of terms, 
called SD' s. 


which describe how the objective func- 
tions change with respect to the design 
variables . 

For realistic design problems, the 
requirements for the MDO scheme may 
change as the design proceeds. Dur- 
ing the conceptual design, for example, 
low-fidelity physics modeling and low- 
accuracy SD calculations may be accept- 
able or desirable, so that the effects 
of a large number of design variables 
can be examined over a broad domain. 
Later, as the design approaches an opti- 
mum point, fewer variables may be per- 
mitted to change, but the calculations 
may require extremely high degrees of 
accuracy. In addition, as the design 
proceeds, new design variables or ob- 
jective functions may be identified and 
others eliminated. Thus, the SD calcu- 
lation scheme should embody a high degree 
of flexibility and predictable levels of 
accuracy. Furthermore, the SD calcula- 
tion scheme must be efficient, because 
this operation represents a large por- 
tion of the total optimization time. 
Several recent research efforts 1-5 have 
demonstrated the potential for using au- 
tomatic differentiation (AD) to compute 
the SD required for MDO from a variety 
of codes. The AD approach meets the 
requirements for versatility, accuracy, 
and speed in an MDO scheme. 

Other methods for computing the SD 
include divided differences (DD) , hand 
coding of derivatives from analytic re- 
lationships, and symbolic manipulators. 
Each of these methods, however, has one 
or more severe drawbacks with respect to 
the stated requirements for flexibility, 
accuracy, and speed. Moreover, these 
methods become unwieldy as the complex- 
ity and resolution of the physical simu- 
lation or the number of design variables 
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increases. However, as demonstrated in 
this paper, the use of AD provides the 
basis for a scalable SD technology suit- 
able for today's parallel computers. 

The SD calculation relevant to the 
aerodynamic design of a transport-type 
aircraft is considered in this paper. 
An important consideration that must be 
made early in the design process is the 
number and type of design variables and 
constraints that will be allowed, because 
this determines the domain in which the 
design can occur. For example, an ex- 
tremely simple wing can be described by 
just a few parameters: the thickness 
and chord specified at the root and tip 
sections, the span, and the leading-edge 
sweep angle. Similarly, a fuselage with 
circular cross sections can be described 
by the three coefficients of an elliptic 
or parabolic radius distribution along 
the fuselage axis. Such simple wings 
and fuselages may be useful in prelimi- 
nary design studies, but they lack much of 
the detail of commercially available con- 
figurations. A more realistic wing-body 
configuration can easily require about 
200 design variables: 8 variables at 
each of 15 wing sections, plus 3 or 4 
variables at each of 20 fuselage cross 
sections. Additional geometric design 
variables (perhaps a total of 1000) can be 
required to describe the location, shape, 
three-axis orientation, and the size of 
the nacelles, pylons, tails, canards, and 
winglets. Note that the effects of any 
of these aircraft components on the aero- 
dynamic properties, such as the lift and 
drag, are realized within a computational 
fluid dynamics (CFD) code by the flow so- 
lution dependence upon the grid (x,y,z) 
coordinates, and thus any of these com- 
ponents may be treated in the same way 
within the CFD code itself. However, as 
the increase in the number of design vari- 
ables and constraints allows for a more 
refined geometric configuration, the op- 
timization problem also becomes more dif- 
ficult and the storage and compute time 
requirements increase. Although thou- 
sands of variables may ultimately be used 
in future design problems, current prac- 
tice (except for some inverse methods) 
usually restricts the design space to 
just a few; this work employs a design 
domain of up to 60 variables and exam- 


ines the storage and compute time issues 
relevant to the SD calculation via AD on 
vector and parallel computers. 

AD and Derivative Strip-Mining 


Automatic differentiation 6-8 is a 
chain-rule-based technique for evaluat- 
ing the derivatives of functions defined 
by computer programs. Automatic dif- 
ferentiation techniques rely on the fact 
that every function, regardless of how 
complicated, is executed on a computer 
as a (potentially long) sequence of el- 
ementary operations such as addition, 
multiplication, and elementary functions 
(e.g., sine and cosine) . By applying 
the chain rule 

df[g(t), h( t )] _ <9f(s,r) dg(t) df(s,r) dh(t) 
dt c)s dt dr dt 

repeatedly to the composition of those 
elementary operations, the derivative in- 
formation of the function f can be com- 
puted exactly (to machine precision) in 
a completely mechanical fashion. 7-9 

The ADIFOR (Automatic Differentiation 
of FORtran) 9-12 tool has been developed 
jointly by the Mathematics and Computer 
Sciences Division at Argonne National 
Laboratory and the Center for Research 
on Parallel Computation at Rice Univer- 
sity. To apply ADIFOR to a given For- 
tran 77 code, the user need only specify 
those variables that correspond to inde- 
pendent and dependent variables with re- 
spect to differentiation; ADIFOR then de- 
termines which variables require associ- 
ated derivative computations, formulates 
the derivative expressions, and generates 
new Fortran 77 code for the computation 
of both the original simulation and the 
associated derivatives. For the remain- 
der of this paper, an ADIFOR-generated 
code version will be designated with a 
.AD suffix. 

The ADIFOR program provides a flexible 
interface that allows for the computation 
of arbitrary linear combinations of the 
columns of the Jacobian matrix. That 
is, if a Fortran code is considered as 
a mapping F: x — *■ y, with associated 
Jacobian 
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ADIFOR generates code that allows one to 
compute JxS, where the so-called seed 
matrix S is initialized at run time. The 
user is hence able to choose, at run 
time, to either compute all sensitivities 
(i.e., the full Jacobian matrix J) , any 
subset of columns of J, an arbitrary 
Jacobian-vector product, or a compressed 
Jacobian to exploit sparsity . 13 

Automatic differentiation is a power- 
ful technique for obtaining accurate and 
efficient sensitivities of design codes, 
but can be demanding in terms of re- 
sources. If we are interested in the 
computation of number-of-design-variable 
(NDV) sensitivities, then the ADIFOR- 
generated code replaces each statement 
involving derivative-related variables 
with possibly some assignment statements 
and a loop of length NDV. Hence, as a 
rule of thumb, the ADIFOR-generated code 
for the computation of NDV sensitivities 
can require up to NDV times the memory of 
the original simulation. The increase 
in run time is more difficult to pre- 
dict, because ADIFOR need not augment ev- 
ery statement in the original code; how- 
ever, experience has shown that the run 
time of the derivative code can be ex- 
pected generally to increase by a factor 
of somewhat less than NDV. 

Because of memory and computational 
time limitations, a single run of the 
sensitivity-enhanced version of a simu- 
lation code may compute fewer than the 
number of sensitivities desired overall. 
These limitations greatly slow down the 
design/sensitivity analysis cycle time 
with a "human in the loop", because both 
the designer and a complete set of design 
information must be available at the same 
time to permit intelligent processing of 
the data and to allow the cycle to pro- 
ceed. To alleviate this problem, paral- 
lel processing is employed to quickly and 
temporarily increase the resources avail- 
able for sensitivity analysis. That is, 
even though the simulation code is a se- 
rial code, coarse-grained parallelism is 
used in the computation of derivatives. 

To illustrate, assume that we wish to 
compute the Jacobian 

dF 

dX; = i 17 


of a function depending on 17 parame- 
ters . The ADIFOR-generated code al- 
lows us to compute any subset of the 
entries of this gradient through proper 
choice of the seed matrix. Hence, we 
can decrease turnaround time by spawning 
several copies of the ADIFOR-generated 
derivative code to run simultaneously on 
multiple processors, as shown in Figure 
1. Here, e_i denotes the i-th canonical 
unit vector. Hence, the first repli- 
cation of the ADIFOR-generated code will 
compute the first six entries of the gra- 
dient, the second one will compute the 
next six, and the third one will compute 
the remaining five. 

This strip-mining approach is simple; 
it can be compared to running several 
simulations in parallel for DD approxi- 
mations of derivatives, decreasing wall- 
clock time with minimal human effort. It 
can easily be mapped onto any collection 
of compute nodes, from a MIMD-type par- 
allel computer to a heterogeneous work- 
station network. 

To implement this approach, we em- 
ployed the Fortran M system . 14,15 Fortran 
M is a small set of extensions to For- 
tran primarily geared toward coordination 
of relatively large-grained parallel ap- 
plications. Loosely speaking, Fortran M 
can be considered an object-oriented ver- 
sion of Fortran, because Fortran M tasks 
exchange information by sending and re- 
ceiving data across channels. Fortran M 
provides constructs for creating tasks 
and channels, for sending messages on 
channels, for mapping tasks and data to 
processors, and so on. Fortran M was 
chosen for two reasons. First, only min- 
imal modifications were required to the 
original serial Fortran 77 code gener- 
ated by ADIFOR because Fortran 77 is a 
subset of Fortran M. Secondly, Fortran 
M supports parallel execution on hetero- 
geneous workstation networks, the most 
ubiquitous parallel-computing resource. 

Given a serial code for the compu- 
tation of sensitivities, a Fortran M 
master process was created which (given 
the number of parallel nodes NP and the 
number of sensitivities NDV to compute) 
spawns ceiling* (NDV/NP ) clones of the se- 


* The ceiling function is an integer division 
that rounds up instead of down. 
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rial derivative code and communicates to 
each node the particular subset of sen- 
sitivities it must compute. Each slave 
process then reads the appropriate subset 
of the total seed matrix from the disk, 
computes its share of the sensitivities, 
and sends the result back to the master. 
This computational harness is simple and 
adaptable to any other ADIFOR-generated 
derivative code with minimal modifica- 
tions . 

Description of Computer Codes 


The WTCO Wing Grid-Generation Code 


The WTCO code is a batch-mode, al- 
gebraic, transfinite interpolation wing 
grid-generation program. The code can 
generate grids around wings of at least 
two airfoil sections that are described 
by ordinates or NACA four-digit airfoils. 
In this case, the usual NACA four-digit 
airfoil family has been expanded by al- 
lowing the maximum camber (cmax) , the 
location of maximum camber (xcmax) , and 
the thickness ratio (t/c) to be speci- 
fied as real number inputs, rather than 
deduced from an integer designation. A 
true NACA 2412 has a cmax = 0.02, xcmax 
= 0.4, and t/c = 0.12; the WTCO program 
has been modified to allow for small per- 
turbations to this shape, such as cmax 
= 0.02002, xcmax = 0.4004, and t/c = 
0.12012. These modifications were nec- 
essary to perform the DD runs used to 
verify the SD calculations by AD. 3 The 
user specifies spacing constants at sev- 
eral points on the grid boundaries, the 
type of boundary surface, the interior 
interpolation, and the amount of smooth- 
ing required. The WTCO code was chosen 
for use in this study for the following 
reasons. The code has been used reli- 
ably with the flow solver on many previ- 
ous occasions, is noniterative and does 
not include coding that is extraneous to 
the grid generation (such as graphic rep- 
resentation) . The source code necessary 
for AD application was readily available. 
In addition, the SD calculations for up 
to four geometric design variables had 
previously been calculated and verified. 3 
The grid coordinates were differentiated 


using ADIFOR with respect to four section 
parameters: maximum camber, location of 

maximum camber, thickness to chord ra- 
tio, and the section twist angle. The 
total number of design variables can be 
changed by changing the number of input 
sections that describe the wing; in this 
study, up to 15 sections with a total of 
60 design variables were used. 

The TLNS3D Aerodynamic Analysis Code 

The TLNS3D code 16 is a high-fidelity 
aerodynamic computer program that solves 
the time-dependent three-dimensional 
(3-D) thin-layer Navier-Stokes equations 
with a finite-volume formulation. The 
code employs grid sequencing, multi- 
grid, and local time stepping to accel- 
erate convergence and efficiently obtain 
steady-state high Reynolds number tur- 
bulent flow solutions. When temporally 
converged to a steady-state solution, the 
method is globally second-order accurate. 
The TLNS3D code is a central-difference 
code that employs second-order central 
differences for all spatial derivatives 
and employs a blending of scalar second- 
difference and fourth-difference artifi- 
cial dissipation to maintain numerical 
stability. The solution is advanced ex- 
plicitly in time with a five-stage Runge- 
Kutta time-marching algorithm. The code 
includes the Baldwin-Lomax (B-L) turbu- 
lence model. This code has been used 
successfully in a number of applications 
across the flight-speed range from low 
subsonic to hypersonic and for a number 
of flight vehicle types. 

The wing lift, drag, and pitching 
moment coefficients (C L , C D , and C M ) are 
differentiated with respect to the grid 
coordinates with ADIFOR. Coupling between 
WTCO. AD and TLNS3D.AD is accomplished by 
passing as files both the grid and the 
grid SD matrix with respect to the above 
geometric parameters. An application 
of the chain rule within TLNS3D.AD is 
used to calculate an aerodynamic SD with 
respect to a geometric design variable 
(for example, the wing lift coefficient) 

dC L _ dC L dX 

~dCT ~~ ~dX~ ' d(4 

In this equation, G is a generic geo- 
metric variable, is a new code seg- 
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ment produced by an ADIFOR application to 
TLNS3D, and is the grid SD array that 
is input to TLNS3D.AD and used to ini- 
tialize the seed matrix. As discussed in 
Refs. 1 and 2, both the function and its 
derivatives are computed in an iterative 
fashion with the multigrid algorithm, al- 
though other additional current research 
is under way to develop a more efficient 
paradigm for AD application to iterative 
algorithms . 

Note that the TLNS3D.AD cases were not 
run to convergence; such runs may require 
several hundred to perhaps 1000 fine-grid 
multigrid iterations. Instead, the code 
was run only 50 coarse-grid cycles and 
50 fine-grid cycles for the timing stud- 
ies shown. Previous results 2 for such 
cases have indicated that both the func- 
tion (i.e., the C L , C D , and C M ) and the 
derivatives converge monotonically and 
that errors in the forces and moments with 
such short multigrid runs are usually be- 
tween 10% and 50%. The runs are, there- 
fore, representative of those that will 
be made during the conceptual or prelim- 
inary design phases. Even runs as short 
as 10 cycles on the fine grid can be used 
to obtain meaningful timing comparisons, 
if care is taken to exclude nonscalable 
times such as the input/output required 
at the beginning and end of the program. 

The TLNS3D.AD results in Ref. 3 
used a 2-section input (root and tip) in 
WTCO.AD, whereas those reported here used 
up to a 15-section input in WTCO.AD (root, 
tip, and linearly interpolated interme- 
diate sections) . Although the 15-sec- 
tion input described the same physical 
wing as the 2-section input, numerical 
differences in the grid and in the grid 
sensitivity matrix that were passed into 
TLNS3D.AD precluded direct comparison be- 
tween derivatives of the wing lift, drag, 
and pitching moments with respect to root 
and tip geometric variables. The impli- 
cation of this effect for a realistic 
design problem is that the maximum reso- 
lution expected in a simulation should 
be built into the design effort from 
the start and used throughout the design 
process, so that consistent results can 
be obtained from the simulation codes at 
different points during the design ef- 
fort . 


The WINGDES Wing Design Code 


Sensitivity derivatives were also 
computed for WINGDES, 17 a low-fidelity 
(linear) vortex lattice aerodynamic 
method that calculates pressure, lift, 
and drag for wings in subsonic and super- 
sonic aerodynamic flows. In addition, 
for a given planform, WINGDES contains an 
inner design loop that optimizes the wing 
camber distribution. The WINGDES method 
also employs nonlinear corrections to the 
lift and drag to account for attainable 
leading— edge thrust. Unger and Hall b 
report on their experiences in applying 
ADIFOR to the WINGDES code, which was 
provided to the authors for use in this 
parallelization study. 

Description of Computer Techniques 


Cray Y-MP Vector Processor 


The original sequential TLNS3D code 
is an efficient, highly vectorized code. 
The initial version of TLNS3D.AD, how- 
ever, was not nearly as efficient on the 
Cray Y-MP computer. The Cray compiler 
Flowtrace and Loopmark options were used 
to identify the time consuming subrou- 
tines and functions of the AD-generated 
code, as well as "do loops" that did 
not vectorize as well as the correspond- 
ing ones in the original code. The 
derivative code was then manually post- 
processed, and additional compiler op- 
tions were used to restore much of the 
code vectorization . 2 

Table 1 shows the timing for the 
fully vectorized (sequential) TLNS3D.AD 
code for NDV ranging from 1 to 60 using 
a single processor of the NASA Langley 
Research Center Cray Y-MP. Note that 
the time per design variable increases 
sharply from four to six design vari- 
ables and then slowly decreases as the 
number of design variables increases be- 
yond six. This limitation is due to the 
Cray compiler, which will unroll the in- 
nermost loops automatically only up to 
a length of 5. In the original TLNS3D 
code, the innermost loops are those that 
involve the flux and residual computa- 
tions; such loops are rich in opera- 
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tions and vectorize efficiently. How- 
ever, in TLNS3D.AD, the innermost loops 
are those generated by ADIFOR of length 
NDV to calculate the gradient objects; in 
TLNS3D.AD, such loops offer little oppor- 
tunity for efficient vectorization, in 
particular for small NDV. However, as 
the time-per-design variable indicates 
in Table 1, extremely good vectorization 
was achieved with the 48- and 60-design- 
variable cases. 

Another issue with SD calculation on 
the Cray Y-MP is the amount of memory 
required by TLNS3D.AD. With the current 
queues and memory configuration of the 
Cray Y-MP, only about 48 design vari- 
ables on the 97x25x17 grid can be con- 
sidered as a routine job submission; for 
the next finest grid (193x49x33), this 
number would drop to about six design 
variables. Although memory exists on the 
Cray Y-MP to accommodate up to perhaps 
80 design variables on the 97x25x17 grid 
and 10 design variables on the 193x49x33 
grid, use of this much computer memory re- 
quires special permission and a dedicated 
user mode. Even the 48-design-variable 
case shown in Table 1, submitted to the 
125MW 3-hr queue, required more than a 
day of elapsed wall-clock time to obtain 
the SD results because of heavy machine 
use. Results from such a job can be 
run for only about 50 fine-grid cycles 
in this queue. If the job were run to 
convergence, it could require at least 
18 hours of Cray Y-MP execution time in 
addition to the time required in the in- 
put queue. The 60 design variable case 
actually fared better in the input and 
executes queues, because it could only 
be run in a dedicated user mode (usually 
overnight, or over a weekend) . This case 
required special permission, but executed 
reasonably quickly, once permission was 
granted to submit the job. 


Parallel Experiments on IBM SP1 


For the WINGDES.AD code, sensitivi- 
ties were computed of pressure, lift, 
and drag with respect to 7 wing planform 
variables, 14 wing thickness variables, 
and 21 wing camber variables (i.e., 42 
design variables overall) . For the 


TLNS3D.AD code, sensitivities were com- 
puted of lift, drag, and pitching mo- 
ments with respect to the twist, maximum 
camber, location of maximum camber, and 
thickness for essentially every wing sec- 
tion on a 97x25x 17 grid. The goal was 
to compute sensitivities for a configura- 
tion specified by 15 wing sections (i.e., 
60 sensitivities overall) . A more de- 
tailed specification of the wing would be 
useless, because it could not be resolved 
further on the chosen grid. 

Both codes were run in single preci- 
sion on the IBM SP1 parallel computer 
at Argonne National Laboratory. For the 
purpose of this experiment, this arrange- 
ment can be viewed as a collection of IBM 
RS/6000 workstations. Each machine has 
a clock speed of 62.5 MHz, 128-MB main 
memory, and a 32-K data and 32-K instruc- 
tion cache. All nodes are connected over 
a high-speed switch. 18 

Serially, one simulation of WINGDES 
without the computation of any sensitiv- 
ities took 11.1 seconds. As shown in Ta- 
ble 2, the serial ADIFOR-generated code 
took 158.2 seconds to compute all 42 sen- 
sitivities. As discussed previously, we 
used strip-mining for the derivative com- 
putations. For each parallel run, the 
average time taken by each slave process 
and the standard deviation of these times 
is shown. In comparison, the time taken 
by the master process was negligible. By 
using 8 processors, the time for comput- 
ing 42 sensitivity derivatives was re- 
duced from 158 to 42 seconds with mini- 
mal effort. Since the standard deviation 
times are small relative to the average 
process times, it is reasonable to ex- 
pect that all the parallel processes can 
be completed in about the average time 
required for one process. 

The TLNS3D.AD code posed a different 
kind of problem. With a 97x25x17 grid, 
only 12 sensitivities could be computed 
serially on a node before memory limi- 
tations were encountered. However, by 
cloning several of these processes, the 
60 sensitivities for the 15-wing-section 
configuration were computed in the times 
shown in Table 3. The times shown do 
not include the time for running WTCO.AD, 
which was executed serially to create the 
grid sensitivities data and save it on 
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disk. The TLNS3D.AD code was run for 50 
cycles on both the fine and the coarse 
grids. If only one processor were avail- 
able, then the memory limitation would 
require the 60-sensitivity job to be run 
as 5 jobs of 12 sensitivities each (us- 
ing different portions of the seed ma- 
trix) for an estimated run time of 9.5 
hours. By employing 15 processors, the 
same set of sensitivities can be obtained 
in about one hour, which represents a dra- 
matic increase in turnaround time. Note 
that it is not possible to obtain perfect 
speedup, as each slave process not only 
has to compute its set of derivatives, 
but also has to re-execute the simula- 
tion run itself. Tables 2 and 3 also 
show that the standard deviation of pro- 
cessor times is low (i.e., all parallel 
processes essentially finish in the same 
amount of time) . 

Comparison of Cray Y-MP and IBM SP1 


For instructional purposes we have 
compared the behavior of the Cray Y- 
MP and the IBM SP1 in computing various 
numbers of sensitivities. Table 4 shows 
the time per iteration for coarse and 
fine grids as well as the time required 
per sensitivity on the fine grid using 
one node of the SP1. These times were 
obtained from runs of 50 cycles on the 
coarse grid and 10 cycles on the fine 
grid. Only the standard compiler op- 
tions were used to produce the SP1 code 
for which the results are shown; (i.e., 
no hand optimization of the SP1 code was 
performed) . This is significant because 
the Cray can vectorize loops with long 
vectors very efficiently. (See the en- 
tries for 48 or 60 design variables in 
Table 1.) Such loops do not execute 
speedily on the RS/6000 architecture be- 
cause of frequent cache misses. 19 Hence, 
the run times obtained should not be taken 
as an indication of the best performance 
that can be obtained with a code like 
TLNS3D on the IBM platform. 

Figure 2 shows the run time ratio of 
IBM to Cray time per cycle on the coarse 
and fine grids for up to twelve design 
variables. For example, the Cray time 
for 12 design variables is 45.95 sec- 
onds on the fine grid (from Table 1), 


whereas the IBM time is 128.2 seconds 
(from Table 4) for twelve design vari- 
ables; the ratio of IBM to Cray times, 
shown in Figure 2 is about 2.8 for twelve 
design variables on the fine grid. In 
order to interpret Figure 2 correctly, 
recall that ADIFOR maps an assignment 
statement from the original TLNS3D code 
into a sequence of scalar assignments, 
plus a loop of length NDV. For NDV less 
than six, the Cray compiler will unroll 
the short innermost derivative loops and 
vectorize the second-layer long loops; 
thus, for one to five design variables 
the Cray vastly outperforms the IBM. The 
Cray runs with good vector speed because 
the innermost derivative loops are un- 
rolled. The superscalar processor on 
the IBM is kept moderately busy with 
small loops and otherwise is slowed down 
through cache misses induced by the large 
second-layer loops . On the other hand, 
as the number of design variables grows 
from 6 to 12 the superscalar chip on the 
IBM runs more efficiently, whereas the 
Cray has only relatively short loops to 
vectorize. The performance of the Cray 
drops with respect to that of the SP1. 
Also note that the performance ratios on 
the fine and the coarse grids are almost 
identical, as could be expected from a 
multigrid algorithm. 

As the number of design variables con- 
tinues to grow, the Cray vector perfor- 
mance improves, as evidenced by the entry 
in Table 1 for 48 and 60 design vari- 
ables. Several possible scenarios ex- 
ist for computing 60 sensitivities on the 
SP1. The single processor memory lim- 
its a job execution to a maximum of 12 
design variables for the 97x25x17 grid. 
Thus, we contrast the execution of 5 dif- 
ferent jobs of 12 design variables each 
run sequentially on a single SP1 proces- 
sor to the use of multiple SP1 processors 
to accommodate more design variables in 
groups of 12 per processor in parallel. 
In the first case, the Cray performance 
continually improves relative to the SP1 
until all the memory of the Cray is used 
up. In the second case, the IBM time re- 
mains essentially constant as more pro- 
cessors are used in parallel to compute 
blocks of 12 design variables each, but 
the Cray time increases with NDV in non- 
linear way. It is this second scenario 



which is the more interesting of the two 
from a computational standpoint. From 
Table 1, it can be seen that the Cray 
takes 76.92 seconds per fine grid iter- 
ation for NDV = 48. The SP1 is assumed 
to take the same amount of time (128.2 
seconds per fine grid iteration, from Ta- 
ble 4) if NDV is 12 or more, since the 
average standard deviation for multiple 
processors (Table 3) is low compared to 
the compute time for the job and only a 
small portion of the total machine is re- 
quired. Hence, comparing the derivative 
strip-mining with four processors on the 
SP1 with the Cray, the runtime ratio of 
the SP1 to the Cray would be 128.2/76.92 
= 1.67 for NDV=48. Similarly, for NDV 
= 60, the Cray time is 72.17 seconds, 
the SP1 time is again assumed to be 128.2 
seconds, and the run time ratio is 1.78. 
These ratios indicate that the SP1 is a 
viable alternative to the Cray even for 
moderately large numbers of design vari- 
ables . 

The Cray times do not include that 
spent in the input queue, which can be 
a significant portion of the total job 
time, particularly as time and memory re- 
quirements (i.e., the number of design 
variables and/or the spatial resolution 
of the problem) increase. At present, 
no such delays are encountered using the 
SP1; however, such delays may become a 
reality as the parallel machine usage 
increases in the future. At present, 
the parallel scenario (either SP1, or a 
network of workstations) provides an at- 
tractive alternative to computing sensi- 
tivities on the Cray for many cases that 
can be accommodated by both and provides 
total machine resources in excess of the 
Cray Y-MP for larger problems. 

Related and Future Work 

Work is in progress to improve fun- 
damental Fortran 77 technology such as 
ADIFOR and to provide automatic dif- 
ferentiation (AD) functionality for C 
programs, resulting in more efficient 
derivative tools for black-box differ- 
entiation. Techniques are also un- 
der development to exploit algorithmic 
structures (for example, iterative so- 
lution procedures common to many pro- 


grams) in the application of AD tools. 
Exploitation of such structure is ben- 
eficial both to increase the numeri- 
cal robustness of the derivative code 
and to speed up derivative computation 20 . 
This approach is closely related to the 
incremental iterative form for comput- 
ing CFD sensitivities 21-23 and applicable 
to computing the sensitivity of optimal 
solutions . 24, 25 

Further parallelism can be exposed 
by exploiting the associativity of the 
derivative chain rule. Consider the 
simulation illustrated in Figure 3 con- 
sisting of three codes, denoted by f, 
g, and h, which may represent different 
disciplines. Code g cannot start be- 
fore f has finished, and h depends on g' s 
output. To get the desired sensitivi- 
ties qp , the serial code for computing 
this derivative could be strip-mined as 
we have done here. Spawning processes 
to separately compute qp,qp,and qp and then 
multiplying these matrices to obtain the 
desired sensitivities is an even better 
technique because the job will execute 
in less time and utilize the machine re- 
sources more effectively. The evalua- 
tion of f will be completed before the 
process that is computing qp has finished; 
thus, another process to compute qp can 
follow the first completed process. As 
a result, the individual sensitivity ma- 
trices qp,qp,andqp will be computed in a 
staggered fashion, in parallel, as shown 
in Figure 4 . Note that in order to break 
the serial dependence in the derivative 
computation, we have to evaluate f and 
g twice, once in the code that computes 
the derivatives qp and qp, and once in the 
"function evaluation" chain, to generate 
the input needed for the next derivative 
process. We also note that the strip- 
mining technique can be applied to each 
of these processes as well. 

In this paper, Fortran M was applied 
after the ADIFOR processing, but tools 
like ADIFOR must be provided for codes 
that are written in parallel languages 
such as Fortran M or High Performance 
Fortran. This avenue is particularly 
critical for high-fidelity codes because 
an AD tool is unlikely to generate sen- 
sitivities with less memory or time than 
the original simulation. No work has yet 
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been attempted in applying AD to a simu- 
lation code that employs explicit message 
passing calls to achieve fine-grained 
parallelization because the present AD- 
IFOR tool does not have the capability 
to imbed the necessary message passing 
constructs which would allow for a fine- 
grained parallelization of the derivative 
calculations. This capability may be in- 
corporated within future versions of the 
ADIFOR tool. 

Conclusions 


This work demonstrates the potential 
of using coarse-grained parallel com- 
puting techniques to reduce the execu- 
tion time required in the computation of 
many design sensitivities for multidis- 
ciplinary problems via automatic differ- 
entiation. The blending of the automatic 
differentiation and the use of a strip- 
mining parallel computation technique, 
under the coordination of a Fortran M 
master process, provides a fast, effi- 
cient, and easily implemented means to 
compute a large number of sensitivities 
as may be required for typical aircraft 
design problems. Significant speed ups 
were achieved for both a vortex lattice 
code (WINGDES) and a thin-layer Navier- 
Stokes code (TLNS3D) for transonic flow. 

Comparisons were also made between 
the coarse-grained parallel implementa- 
tion and a traditional vector supercom- 
puter implementation of the derivative 
computation for up to 60 design vari- 
ables. For up to 5 design variables, 
the vector supercomputer implementation 
is much faster, but for 6 or more de- 
sign variables, the coarse-grained par- 
allel implementation is nearly as fast in 
execution time, and can be much faster in 
elapsed job turn around time. The mem- 
ory and queue limitations on the vector 
computer implementation also may also se- 
verely restrict the size or length of the 
calculation which can be done, relative 
to the parallel computer implementation. 
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Table 1. Cray Y-MP Single Processor 
Timing (seconds) for TLNS3D.AD Code with 
Two-Level Multigrid on 97x25x17 Viscous 
Grid 


NDV 

Time/Iter. 

(coarse 

grid) 

Time/Iter, 
(fine grid) 

Time/NDV 
(fine grid) 

01 

0.51 

05.08 

5.08 

02 

0.70 

07.19 

3.59 

04 

1.09 

11.85 

2.96 

06 

3.63 

40.68 

6.78 

08 

3.74 

44.59 

5.57 

10 

3.89 

43.76 

4.38 

12 

4.02 

45.95 

3.83 

48 

6.62 

76.92 

1.60 

60 

6.19 

72.17 

1.20 


Table 2. IBM SP1 Multiprocessor 
Timing (seconds) for WINGDES.AD Code 


Number of 
processors 

Average time 

Standard 

deviation 

1 

158.2 

0.00 

2 

92.0 

2.19 

4 

58.4 

1.83 

8 

41.8 

1.09 


Table 3. IBM SP1 Multiprocessor 
Timing (seconds) for TLNS3D.AD Code with 
Two-Level Multigrid on 97x25x17 Viscous 
Grid (NDV=60 ) 


Number of 
processors 

Average time 

Standard 

deviation 

01 

34,160 

(estimated) 

0.00 

05 

6,832 

53.1 

15 

3,757 

11.5 


Table 4. IBM SP1 Single-Processor 
Timing (seconds) for TLNS3D.AD Code with 
Two-Level Multigrid on 97x25x17 Viscous 
Grid 


NDV 

Time/iter. 

(coarse 

grid) 

Time/iter, 
(fine grid) 

Time/NDV 
(fine grid) 

01 

4.34 

51.5 

51.5 

02 

5.00 

58.7 

29.4 

04 

6.08 

66.9 

16.7 

06 

7.54 

82.1 

13.7 

08 

8.74 

97.4 

12.2 

10 

9.76 

108.5 

10.8 

12 

10.76 

128.2 

10.7 
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Figure 1: An Example of Parallel Sensitivity Computation through Derivative 
Stripmining 


Comparison between Cray Y/MP and a Node of the IBM SP/1 



Figure 2: Cray vs. IBM Performance. 
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Figure 3: A Sequence of Simulation Modules 



Figure 4: Exploiting Chain Rule Associativity to Break Dependencies 
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